Bioinformatics for Computational Biology — Appunti TiTilda

Indice

Introduction

Hello reader this is a short summary of our notes about the course “FAI”. If you find an error or you think that one point isn’t clear please tell me and I fix it (sorry for my bad english). -NP

Chapter One: The cell

Cell: Unit of living being

Divided in:

Prokaryotes: Have a nucleus not clearly separated from the rest of cellular matter. Are unicellular organisms.

Structure:

Eukaryote: enclosed by a plasma membrane that contain:

1.1 The Big Four

All the cell are constituted by 4 main types of macromolecules:

Aminoacids’ structure:

1 Central atom of carbon (C) linked with:

Nucleic acids:

DNA= contains ALL the genetic information that is necessary for the life of the host organism. It’s organized in chromosomes.

Prokaryotes have only one chromosomes.

A cell can be haploid, diploid, triploid,…

This means that the n° of the chromosomes are n, 2n, 3n,…

1.2 Mitosis

Mitosis -> Asexual reproduction for a single cell.

“Mitosis”

1.3 Meiosis

Meiosis -> Sexual Reproduction between:

“Meiosis”

Chapter Two: Mendelian Genetics

In the 1865 Gregor Mendel create the inheritance transmission laws

2.1 First Law: Law of dominance

Mendel crossed plans that differ for only one alternative discrete character so pure dominant (RR) and pure recessive (rr)

The result that him obtain was all the four children displayed only dominant traits.

2.2 Second Law: law of segregation

“Second generation”

Ratio is 3:1

3 dominant;

1 recessive.

2.3 Third law: law of trait independent segregation

In F3 with 2 traits ratio is (3:1) * (3:1) = 9:3:3:1

9 dominant;

3 dominant trait and recessive

3 recessive and dominant

1 pure recessive

The second law evident that:

Linkcage: association between genes or groups of gene.

Thomas H. Morgan demonstrated the association among different genes of a chromosome exists, but is not total.

crossing - over: during meiosis, homolog chromosomes can exchange genetic material.

1 centimorgan \implies percentage of a recombined chromosomes in off springs, as a measure of the relative distance for a gene pair.

The percentage of recombination between two genes is proportional to their relative distance.

2.4 Sexual characters

Inheritance of genetic characters has particular importance for genes located on sexual chromosomes.

Sexual chromosomes: X and Y

In human, genes located in X and Y are associated with sex.

There are traits, influenced by sex and limited by sex.

Some pathologies linked to genetic alterations on X or Y:

For the genetic transmission the normal allele is dominant, the mutated one is recessive.

Gene can interact, generating unpredicted phenotypes and anomaly, like:

Phenotype: group of observable characteristics.

phenotype: genotype + environment

Same genotype can express different phenotype example: a female bee with the same chromosome complement of others, can became queen bee only if fed with royal jelly, otherwise it becomes workers bee.

1908, G.Hardy and W. Weimberg \implies in a balance population frequencies of genes and genotype tend to remain constant.

Given the distribution of paired A and a alleles we want to know the relative frequencies.

Pair: AA, Aa, aa -> the frequencies of A is unknown because it is present in AA and in Aa.

p: frequency of A.

q: frequency of a.

p+q = 100 \% \implies p = 1 - q \rightleftarrows q = 1 - p

AA: p^2

aa: q^2

aA: 2pq

So if AA + 2Aa + aa = (A+a)^2 = 1

p^2 + 2pq + q^2 = (p+q)^2 = 1.

Now we proved that the frequencies remain constant.

Given as known that AA is p^2, Aa is pq we can said that f_A: p^2 + pq, but we know that q = 1 - p so:

\begin{align*} f_a : p^2 + p(1-p) \implies p \end{align*}

we can do same reasoning fo f_a

When this happen we said that the population is balanced.

Fitness: The measure of reproductive ability of an individual.

Chapter Three: Molecular genetics

3.1 DNA

DNA (Deoxyribonucleic Acid)

DNA Chain: long sequence of nucleotides linked by the bond between the phosphoric acid of a nucleotide and the sugar of the subsequent nucleotide.

This bond is called 3' - 5'

In 1953, Watson and Crick defined the exact spatial structure of DNA considering two experimental result:

In 1945 Chargaff found that in the DNA of each organism.

In 1945 Rosalind Franklin and M. Wilkins obtained the first photographs of diffraction spectra from x rays of pure DNA fibers’ crystals, that showed:

From this information:

bp: base pair \implies n° of base pair before a regular behaviour

DNA’s structure types:

Each human cell \to 1m of DNA \implies packaging, technique that ultra compress the DNA.

3.2 RNA Structure

RNA is the protagonist in the synthesis of proteins.

Structural different point of view, RNA to DNA:

In some living being without DNA, RNA plays a leading role in reproduction process.

In cells, there are different types of RNA:

Of all these, mRNAs, rRNAs and tRNAs play important roles in proteins synthesis, others in regulation.

3.3 Virus

Genome: genetic material of an organism.

Generally indicates DNA, often also RNA and proteins.

Viruses: The most simple life forms -> cellular parasites.

They must use another cell for reproduce themself.

Virus genome: 1 molecule of nucleic acid (DNA or RNA) enclosed in a protein shell (capsid) with different shapes.

The viruses can be divided in 3 classes:

Bacteriophages: capsid with icosahedral head, containing genetic material, connected to a hollow cylinder (tail) to which filamentous structures (spikes) are linked, which allow the hanging of the virus on the bacterial cell’s wall.

When hanged, virus injects it genetic material inside the cell, where it reproduces itself.

3.3.1 Virus of eukaryote cells

Cellular transformation: phenomena coming from integration of virus in host cell’s DNA.

3.3.2 Retrovirus

Retrovirus: e.g Human Immunodeficiency Virus (HIV).

\begin{align*} \text{Viral RNA} \xrightarrow{\text{reverse transcriptase}} \text{Viral DNA} \end{align*}

Then transformed in double helix by another enzyme active in cell nucleus, DNA polymerase.

\implies genome virus can integrate in the host cell’s genome and reproduce itself.

3.4 Bacterial genome

Bacterial cells: (prokaryotes) don’t have a defined nucleus, but a compact structure (nucleoid):

In many bacteria, there are also small circular molecules of DNA \to plasmid

Exist variant plasmid:

3.5 Genome of Eukaryote

Chromosomes \xrightarrow{constituted} chromatin \implies

If a protein is strictly linked to DNA is called histones

Chromatin has structured like a necklace:

Context: cellular division

Metaphase: Chromosomes assume the X shape:

The position of centromere, length of chromatids and dimension of chromosomes identify different chromosomes -> karyotype of the organism.

I can highlight areas with rich in A and T bases

the areas with rich in C and G remains pale.

I generated a striped arrangement -> bands

Nomenclature: Each band has a specific nomenclature (e.g. 6p21.3)

Not all genetic material of eurokaryotes is in the nucleus.

Small fraction of circular DNA is in cellular organelles:

Extra-nuclear genes in cytoplasm are transmitted to the off spring just by the ovum -> maternal inheritance

To guarantee the transmission, the DNA is copied -> duplication

Same process in eukaryotes and prokaryotes, semi-conservative -> each daughter has 1 strand of DNA of the mother and 1 of new synthesis.

Semi-conservative duplication:

The 2 double helixes consisting in:

Replication fork: Area where double helix opens and synthesis starts.

The duplication happens in specific position at a time (not concurrent):

  1. Use a specific enzyme for opening localization.

  2. Copying pairing of bases and polymerization of nucleotides.

    \xrightarrow[\text{enzymes}]{} **DNA polymerase and Ⅰ \implies both direction

    On 5' to 3' copying.

    On 3' to 5' copying but we have okasaki fragments, segment of DNA that are synthesized and linked on DNA discontinuously by DNA ligase enzyme.

  3. Re- closing of double helix by specific enzymes.

3.6 Gene structure

Codon: triplet of bases.

But how the gene is structured ?

start codon: unique triplet which declares the beginning of the gene.

stop codon: unique triplet which declares the end of the gene.

Human genome -> 3000 Mb (Mega bases) \implies 22k - 25k genes.

Central Dogma of Molecular Biology (Crick 1958)

Transcription DNA to RNA from 5' to 3' \implies only one helix of DNA is used -> mold helix

Synthesis is catalyzed by enzyme RNA polymerase -> different in prokaryotes and eukaryotes.

3 phases:

  1. Starting transcription

    • RNA polymerase bonds (with its \sigma) with gene’s promoter.
    • \sigma detaches and transcription starts
  2. POlymerization of polynucleotide RNA -> elongation

  3. Detaching of synthesized RNA and end of transcription.

Transcription factors: proteins;

The transcription factors necessary for each polymerase.

RNA polymerase synthesizes RNA in continuous way.

splicing: In eukaryotes, remove intron from pre-RNA -> mature RNA

There are different types of splicing:

In a gene with alternative splicing, the majority of exons is always included in final mRNA.

Exist 4 types:

Splicing is not well-know process.

In some cases, small Nuclear RiboNucleoProtein (SNRNP) cut:

  1. 5' end of intron by dinucleotide GU
  2. In 3' end by dinucleotide AG
  3. exons link together.

Exist some intron that follow GU-AG rules without SNRNP -> auto-splicing -> these RNA are called ribozyme

After splicing -> stabilizing mRNA by adding:

3.7 Different types of RNA

Structural and functional components of ribosomes, where the synthesis of proteins occurs.

rRNA, tRNA and mRNA participate in translation.

3.8 Genetic Code

Genetic code: group of rules defining how the information of nucleotides’ sequence in mRNA (4 bases A,G,C,U) is translated in aminoacids’ sequence of the codified protein (20 aminoacids).

Each codon:

Feature of genetic code

This characteristic about triplets may occur concurrent encoding cause by splicing and mostly alternative splicing.

Some alternative transcripts are tissue-specific \implies expressed only in one specific type of cell.

Mechanisms of genetic code and alternative splicing allow encoding and production of many proteins with different functions from the same DNA.

3.9 Translation

Translation: complex process involving many cellular components: rRNA, mRNA and tRNA.

tRNAs are junctions between nucleotides of mRNA and aminoacids of protein:

The translation is same for prokaryotes and eukaryotes, has 3 phases:

  1. Start
    • Ribosome bonds to mRNA by starting triplet (AUG);
    • Identification of mRNA’s AUG triplet by complementary specific tRNA triplet (anticodon)
    • Bond of tRNA that brings aminoacid corresponding to AUG triplet (Met)
  2. Synthesis:
    • Process goes on;
    • Ribosome moves along mRNA;
    • Only 1 triplet available at a time for bonding to specific tRNA;
    • Aminoacids brought by tRNA are near;
    • When ribosome moves, a peptide bond is created between last aminoacid transported and the previously one.
    • Protein chain extends due to ribosome moving.
  3. End:
    • When ribosome reaches a stop triplet (UAA, UAG, UGA)
      • Detaches from mRNA.
      • Sets protein chain free.

Each ribosome builds only 1 protein at a time.

In bacteria (prokaryotes), requiring synthesis of many copies of the same protein in short time (minutes):

In bacteria transcription and translation are paired.

Central Dogma

Not all genes are always necessary for the life of a cell -> only constituent one (necessary for the life of the cell) are always expressed, other expressed when necessary.

3.10 Genetic Expression

Expression of genes is controlled by cellular needs: environment conditions and functions to execute.

Multi-cellular organisms:

Bacteria

Francois Jacob and Jaques Monod (1960 - 64) use lactose in E.coli.

Lactose is a disaccharide (sugar of 2 monomers, glucose and galactose) that can be utilized when divided into the 2 components inside the cell.

Splitting of lactose is realized by enzymes codified by 3 genes:

In default of lactose, in the cell \cong 5 molecules of each enzyme.

As for the sugar, if lactose is the only source of energy, synthesis of enzymes is rapidly stimulated \implies inducible enzymes

Genes IacZ, IacY and IacA -> structural genes, are consecutive on bacterial chromosomes and transcribed in the same mRNA.

Before the big three, there is IacI that regulates them ; its elimination brings continuous synthesis of 3 the enzymes.

Mechanism of regulation

Repressor bonded to operator prevents RNA polymerase transcription of 3 structural genes.

If lactose is present, it bonds to repressor, changes its 3D conformation preventing its bond to operator.

When lactose is totally consumed, repressor bonds again to operator and synthesis stops.

Superior Organisms

Main mechanisms are similar but regulation is more complex.

Genetic expression regulated by proteins, transcription factors, bond DNA sites before gene, Transcription Factor Binding Sites, and can allow or stop bond of RNA polymerase to promoter of gene.

Example

Protein metallothionein that protects cells from toxic effect of metals free in the environment:

Gene of metallothionein is transcribed by RNA polymerase II

Many traits of DNA before gene are involved in its expression:

Such zones, elements of response to metals, modulate transcription based on metals’ concentration.

Transcription factors have leading role in regulation:

More common structure \implies helix-turn-helix and zinc-finger

3.11 Proteins

Proteins: Macro-polymers constituted by linking of aminoacids (minimum 3); there are 20 aminoacids:

Peptide: short polymer constituted by the linkage of aminoacids bonded with peptide bonds.

Peptide bond: bond between N-terminus of an aminoacid and C-terminus of another one -> planar and rigid \implies NO rotable-bond

Polypeptides: have 1 free N-terminus (beginning) and 1 free C-terminus (end) -> contains from 3 to various hundreds of aminoacids.

Proteins have different functions, ultimate for all organisms:

Function executed by protein depends on properties of protein, determined by:

3.12 Proteins’ structure

It’s structured in 4 related levels:

Isoform: two protein which are different for little details, due to alternative splicing or to polymorphisms.

3.13 Genetic mutations

During duplication of DNA it is possible to have variation in the sequence of nucleotide bases (mutations) that are transmitted to offspring (mutants):

Single Nucleotide Polymorphism, or SNP: is the variation of 1 single nucleotide in an individual’s DNA sequence.

\begin{align*} CCU \to \text{Pro} \xrightarrow{SNP} CCC \to \text{Pro} \text{in this case nothing change} \\ AAG \to \text{Lys} \xrightarrow{SNP} GAG \to \text{Glu} \text{in this case SNP change the protein synthesized} \end{align*}

SNPs are likely to be good biological markers.

3.14 Types of genetic mutations

3 classes of mutations:

3.14.1 Genetic Mutations

Genetic mutations can derive from different alterations:

In the substitution scenario is possible have a situation like: A-T \xrightarrow{substitution} G-T, in this case, G-T is an unstable bond and in the next replication we can have G-C or A-T

3.14.2 Chromosomal mutations

Chromosomal mutations: changes in chromosomal structure compared to normal karyotype

Main types of chromosomal anomalies:

Less harmful than deletion

!!! In human, involved in tumor on set !!!

3.14.3 Genomic Mutations

Genomic mutations concern total number of chromosomes in each cell of an individual

Example:

From errors in meiotic process, like failed disjunction in pair of homolog chromosomes:

3.15 Mutagens

Frequency of mutations can increase if organism is exposed to substances and radiations (mutagens) that interact with DNA and can induce changes in nucleotide sequence.

3.16 Fixing DNA and Genome

All living being have various cellular mechanisms for fixing DNA damages:

In mankind lack or reduction of one or more involved enzymes is associated with inherited pathology that brings formation of skin tumors due to ultraviolet radiations present in solar rays.

Genome: Entire genetic material of an organism

In bioinformatics, genomic data/information: whole of available data and information, related to genetic material of an organism.

Transcriptome: whole of all possible transcripts of an organism.

Proteome: whole of all possible proteins of an organism, deriving from different transcripts.

3.17 Evolutionary biology

Evolutionary biology is a sub-field of biology regarding the origin of species from a common ancestor, as well as their changes, multiplications and diversifications over time.

Then:

Today:

Sequence regions that are homologous are also called conserved

Sequence homology may indicate common function

Homologous sequences are said orthologous if they were separated by a speciation event.

Homologous sequences are said paralogous if they were separated by a gene duplication event.

So, homologous sequences can be divided into two groups:

Phylogenesisor phylogenetic: study of life’s evolution

Taxonomy: classification of organisms depending on similarities.

Phylogenetic trees: diagram that shows relation of common descent of taxonomic groups of organisms.

Computational phylogenetic: concerns the compilation of phylogenetic tree and the study of anatomic, biochemical, genetic and paleontological data used for their construction.

Phylogenetic trees are built on the base of a high number of genetic sequences.

Many techniques used to identify the best tree -> complexity NP (Nondet.Polinomial- time)

Phylogenetic trees are important but have some limits:

Chapter Four: Biomolecular Sequence Analysis

Why do we do sequence comparison ?

Two types of alignment:

Different technique:

4.1 Alignment 2 sequences

4.1.1 Dot matrix

Simplest one, we build a matrix with sequence 1 as column and sequence 2 as row and we put an “x” when we have a match.

Filtering of background noise

Pros:

Cons:

Practice

4.1.2 Pairwise alignment

Also simple, it’s based in 3 type of action:

We write the two sequences and compare.

-: gap

C \\ | \\ C: match

G \\ | \\ C: mismatch

We can assign a score, use:

gap = - 2

mismatch = -1

match = +2

highest is the score better is the alignment.

Distance between two strings:

Now we will talk about the score assign to gap, mismatch and match. Why ?

Because biologically the substitution cannot be consider equal each other.

We have to consider:

4.1.3 Substitution matrix

Substitution matrix: Assign value to each possible pair of characters

There are two main types of matrices:

PAM

PAM matrices: developed in the late 70s looking for mutations in closely correlated superfamilies of amino acid sequences.

Accepted Mutation: accepted by evolution.

For the construction of PAM matrices homogeneous blocks of aligned sequences are considered.

To avoid the problem of multiple substitutions, very similar sequences are chosen to determine PAM matrix:

For each amino acid (j), count all N_{jk} changes (quantity of changes) in another amino acid (k)

Normalize by dividing by the total number of changes (\sum_{m} A_{jm}, 1 \leq m \leq n)

n = number of amino acids = 20

A_{jk} =\frac{N_{jk}}{\sum_m A_{jm}}

PAM contains the log odd probability (p) of transition of each amino acid into another amino acid

p = log (odd(P))

odd(P) = \frac{P}{1-P} \implies p = log (\frac{P}{1-P})

If PAM_{i,j} > 0, likely transition of i in j

If PAM_{i,j} = 0, random transition of i in j

If PAM_{i,j} < 0, unlikely transition of i in j

The classical PAM expresses the probability of change in one step \implies PAM1

If we want in two step: PAM1 * PAM1 \implies (PAM1)^2 \implies PAM2

In ten: (PAM1)^10=PAM10

This is the percentages of change, PAM2 \implies 2\%

The number identify the evolutionary step, the change of an aminoacid out of 100 ones.

The PAM250 is the most used, the amino acid sequences maintain at this level 20\% of similarity.

Example

Take the PAM250(F \to Y): 0.15

Divide by frequency of changes into F(0.04) = log_{10}(\frac{0.15}{0.04}) = 0.57

like wise for Y \to F : log_{10} (\frac{0.2}{0.03}) = 0.83

Calculate the score for a change F,Y as 10 * \frac{(0.83 + 0.57)}{2} = 7

We will obtain something like this:

“PAM250 log odds”
BLOSUM

BLOSUM of substitution of amino acids

A block is a highly conserved region without gaps

How calculate the matrix ?

For each pair of amino acids x and y, calculate the ratio of the like hood (e_{xy}) that x and y are aligned by chance

Example

Values calculated based on the substitutions in a set of 2000 conserved patterns

To avoid that very similar sequences in a block polarize the estimation, clusters are created in the block.

To find relationships between sequences close in time by evolutionary point of view, a large n is used.

BLOSUM62 is the standard.

BLOSUM vs PAM

So the gap penalty is g = g_o + g_e * (l-1)

where g_o is the first gap penalty, high

g_e is the gap extension, lower penalty

l is the length of the gap block

In the scholar exercise we use a linear gap penalty for simplify.

But in the real software we use the real gap penalty.

This because in biology is likely that a mutation event makes a long gap than a lot of scattered gaps.

4.1.4 Needleman - Wunsh

This method is an algorithm, the optimal one.

This algorithm is also complete.

Optimal algorithm: find the best solution, if it find one.

Complete algorithm: If exist, the algorithm find always a solution.

How works ?

First penalty and rewards:

BUild the matrix like the dot matrix but an additional row and column.

Now starting the matrix with a 0 in the (0,0) cell, now we write the score, there are 3 movement:

We have to complete the matrix continue the score with the movement that maximize this one.

We start filled the row 0 and the column 0 with “gap movement”

After fulfill the matrix we have the best score in the last cell (4,4), in this case, now we backtrack the movement and obtain the solution(s).

Solution:

ATGC \\ ATCC

Score: 5

In general we use PAM or BLOSUM matrix for the score.

We can obtain more of one solution, in this case we write all of them.

This algorithm is used for global alignment, so for find the best alignment in the whole sequences.

4.1.5 Smith - Waterman algorithm

This algorithm is for local alignment so for find the best sub-sequences.

It’s completely equal to the Needleman-Wunsh algorithm with one crucial difference there aren’t negative score. So the same matrix above become:

So in this case the best sub-sequence is

ATGC \\ ATCC

Score: 5

It’s possible there are multiple sub-sequence with the highest score, we have to write all of them, REMEMBER if a score going to zero this is a reset point, so next score is for another sub-sequence, either if a score decrease but don’t go to zero is the same sub-sequence, as in the example.

For ANY pairwise alignment, the used measure are:

For the score z a score \geq 5 suggests significance of the alignment found between the two sequences.

Probability p is obtained by: p = 1 - e^{-kmne^{-\lambda S}}

Guide for the E score:

4.2 Database Research

The classic programs that search for sequences in databases are FASTA and BLAST

The heuristic principle that these programs use is the search for “words” in databases.

word: short series of characters in the sequences of amino acids or nucleic acids.

These words are indicated with the term k-tuple -> k = n° of characters.

4.2.1 FASTA

FASTA (FAST - All) it’s an heuristic program that can search for global homology of sequences.

Exist two variants that can search for local homology:

FASTA is specific but not quite sensitive.

4 phases:

Initially, create a positional table containing all the positions for each amino acid (or nucleotide) in the query sequence and in each sequence in the database

I can built considering the position individually 1-tuple or in pair 2-tuple. For the nucleotide 4-tuple or 6-tuple.

Calculate the difference of positional values of each amino acid between the query and the database.

The best 10 regions (best 10 subsequence) selected are evaluated through the score matrices, the sub regions that contain the bases that maximize the regions score are identified.

The aim is finding the initial region with the best score, to be used to create a rank of the sequences in the database, in order to define which of them are the most similar to the query sequence.

FASTA evaluates if it’s possible to join together different regions of similarity.

Constraints to create the join:

Sequences with higher similarity are aligned to the query sequence using the procedure based on a modified Smith-Waterman algorithm \implies optimized score (OPT)

4.2.2 BLAST

BLAST (Basic Local Alignment Search Tool)

It searches for best local alignment between a query sequence and the sequences in a database

Features:

While FASTA searches all possible words of the same length, BLAST limits the search to the most significant words using a preventive filter.

For score, in case of protein it uses BLOSUM62

BLAST fixes the length of the word to:

3 Phases:

It generated a list of words of length W from the query sequence.

For each words, we assigned to each 20^3 words found in the database.

Use a threshold T to limit the number of analogous words.

The search (exact) of the best analogous words in the sequences of the sequences of the database is performed.

When searched analogous words are found in database’s sequences, they identify regions of possible local alignment (without gap) between the query sequence and the sequences found in the database

The algorithm tries to extend aligned regions, without allowing gaps, and until extended alignment score does not decrease \implies High - Scoring Segmented Pairs (HPS)

HPS is considered relevant if exceeds a threshold value S.

Important: At the end, it generated the best alignment according to Smith-Waterman algorithm.

Variation of E and p

p-value

Filters

BLAST vs FASTA

Both heuristic

They don’t grant to find the best alignment

Variant of BLAST

4.3 Information

Motifs are regular combinations of protein secondary structures associated with particular functions.

So same motif \implies similar function

Search for protein motifs can identify new genes and study the diffusion of specific motifs in different genomes.

Uses for search protein motifs

4.4 Multiple Alignment

Why ?

The alignment in pair allows:

The multiple alignment allows:

Formal definition:

A multiple alignment associates with S_1, ..., S_k the sequences S_1', ..., S_k' : S_i' \in (\Sigma \cup \{-\}) for 1 \leq i \leq k so that:

Profiles

Profiles are useful structures for summarizing the common proprieties of groups of sequences and they are the basis of many methods of multiple sequence alignment

Example:

Shannon Entropy

GIven a probability space (s,p), the entropy H is a measure of dispersion of the probability function of the objects in the space S

\begin{align*} H = - \sum_{i=1}^m p_i log_2 p_i \end{align*}

Information content

Given a matrix of weights that models sequence alignment, you can determinate the information content I(k) for each alignment position k:

\begin{align*} I(k) = log_2(m) - (H(k) + e(n)) \end{align*}

The alignment logo shows the information content of each position of the multiple alignment.

Usefully of extraction profile

Databases of profiles/patters:

To align a sequence to a profile, we use Needleman - Wunsh but with a different scoring function.

\begin{align*} \sigma_{sp} (b,i) = \sum_{a \in \Sigma} P_{i,a} \sigma (a,b) \end{align*}

To align two profiles -> \sigma_{pp} (i,j) = \sum_{k=1}^{|\Sigma| + 1} f(P_{i,k}', P_{j,k}'')

So different multiple alignment \implies we need a score standard to be able to compare them:

The most used function is the Sum - of - Pairs score, sum of the scores, sum of the scores of the pairwise alignments induced by the multiple alignment:

\begin{align*} \sigma (s) = \sum_{i=1}^{i<z} \sum_{j=i+1}^z S(s_i, s_j) \end{align*}

S(s_i,s_j) is the score of alignment of pairs of sequences s_i and s_j induced by multiple alignment M.

z is the number of sequences in the multiple alignment.

Example

S1: ACTCT \\ S2: A-TTT \\ S3: A-TTT \\ \sigma(s) = S(s_1,s_2) + S(s_1,s_3) + S(s_2, s_3) = 3 + 3 + 6 = 12

Other function of scoring:

Entropy

H(A) = \sum_{c \in A} H(c)

H(c) = - (\sum_{x \in \Sigma} p_x log_2 p_x)

c: column of the alignment A

p_x: frequency of the symbol x in column c.

Example

ACT \\ ACA \\ A-T \\ H(1) = - (\frac{3}{3} log_2 \frac{3}{3} + 0 + 0 + 0 + 0) = 0 \\ H(2) = - (0 + \frac{2}{3} log_2 \frac{2}{3} +0 + 0 + \frac{1}{3} log_2 \frac{1}{3}) = 0.92 \\ H(3) = - (\frac{1}{3} log_2 \frac{1}{3} + 0 + 0 + \frac{2}{3} log_2 \frac{2}{3} + 0) = 0.92 \\ H(A) = 0 + 0.92 + 0.92 = 1.84

Circular Sum

CS(A) = \frac{1}{2} \sum_{i=1}^z MPA(a_i, a_{i+1})

We do the pairwise score sum immediately, so:

match: +1

mismatch/gap: -1

ACA \\ ACC \\ AT- \\ MPA(a_1,a_2) = 1 + 1 - 1 = 1 \\ MPA(a_2,a_3) = 1 - 1 - 1 = -1 \\ MPA(a_3,a_1) = 1 - 1 - 1 = -1 \\ CS(A) = \frac{1}{2}(1 - 1 - 1) = -1

Sum-of-Pair vs Circular Sum

Sum-of-pair is clearly inefficient from an evolutionary point of view

4.4.1 Dynamic programming

Now if we have 2 sequences we need a 2d-matrix, 3 sequences 3d-matrix, n sequences nd-matrix. This approach is very complex, it’s called NP-complete(Non Polynomial) \implies very difficult and a lot of time.

Example 10 sequences each length = 100 we have 100^{10} = 10^{20} elements \implies 100 mil. terabytes

Solution \implies Heuristic and approximations.

Heuristic methods:

4.4.2 Progressive alignment

Simple and the most common

Idea: we align 2 sequence, then we align other 2 and we continue then we align the 2 aligned sequences with other 2 or with an unaligned one.

Like MERGE-SORT

Heuristic: similarity degree

So we aligned the similar ones, until we remain without sequence unaligned.

Feng - Doolittle

Algorithm that implements progressive alignment heuristics:

4.4.3 Star-Center

Given a set S of z sequences, we define central sequence S_C \in S the sequence that minimizes the function:

\sum_{S_j \in S} D(S_C, S_j)

or, the sum of the distances of all the sequences from S_C will be the minimum possible.

Then we use Sum-of-Pairs

4.4.4 Iterative Alignment

It starts aligning the newest couple of sequences according to a certain definition of distance (not the same pf progressive).

Then, at each step it takes the sequence with the minimum distance from all sequences already aligned and it aligns it to the alignment profile already created.

In case, create new space "-"

4.5 Multiple alignment tool

ClustalW is the most popular tool for the multiple alignment.

Then, it builds a phylogenetic tree, it consider the first couple how a singular sequence and builds another similarity matrix and go until finish the tree.

It is obtained a tree with branches of length proportional to the distance between the sequences \implies dendrogram

Details of ClustalW’s output

At the bottom of each column:

Chapter Five: Measurement of Genetic Expression

Genetic Expression: Conversion of coded information in a gene, for coding genes, first in messenger RNA and the in protein.

Not every gene is always necessary for the cell life

Gene expression is regulated by the cell necessity: environment conditions and functions necessary to be performed.

The genetic expression, is different depending on the cell type and the answer from the environment.

The transcriptome is the complete set of gene transcripts and of their levels of expression, in a particular type of cells or tissue, in well defined conditions.

To understand biological organisms it is necessary to study:

System biology: study of interactions between components of a biological system and how such interactions induce functions and behaviour of the system.

For functional analysis of genomes:

High - throughput procedures.

These approaches of genotyping must be correlated with phenotypic analysis of model organisms and cells in vitro.

5.1 Gene expression analysis techniques

How to measure the gene expression ?

Methods to measure the expression level \frac{gene(s)}{time}: RT-PCR (Reverse Transcriptase Polymerase Chain Reaction)

Main analysis techniques of the whole transcriptome:

1980: RNA analysis of one or few genes at a time

1995: RNA analysis whole genome

Two main technologies of DNA microarrays:

5.1.1 Northern Blot

Laboratory technique to study genetic expression, by finding the RNA (or isolated mRNA) in a sample

In 4 step:

  1. RNA Extraction: we extract the RNA.
  2. Preparation of the probe: fragment of the gene that we have to analyze.
  3. Hybridization: We wait until the probe and the gene create a bond, if the gene is expressed.
  4. The probe is a marker (radioactive or fluorescent) and its insensitive is proportional to the quantity of expressed gene.

5.1.2 RT-procedure

The polymerase chain reaction (PCR) is a laboratory technique exploiting DNA replication to amplify a single or few couple of specific sequence of DNA, up to \cong 10kb long, also 40kb.

PCR is based on thermal cycles of heating and cooling of a solution where the replication reaction of DNA occurs, we use high temperature for divide the helix of DNA, and low temperature for replication of DNA.

The reverse transcription polymerase chain reaction (RT-PCR) is a variation of the PCR, in which a RNA helix, firstly is reverse-transcribed in its complementary DNA (cDNA), by using the enzyme reverse transcriptase, and the resulting cDNA is amplified by using traditional PCR, or real-time PCR, made in a thermal cycler for automatic time and temperature control.

Another way to replicate pieces of DNA uses plasmids if bacterial as vectors to clone DNA sequences, the DNA fragment are inserted in the DNA sequence of the plasmid and the DNA ligase enzyme to bind to the plasmid DNA fragment to be cloned -> recombinant plasmid.

5.2 DNA microarrays

Microarrays: orderly and miniaturized arrangements of fragments of DNA with know sequences on solid support.

The microarrays is the evolution of Northern blot, microarrays can analyze the entire genome while the Northern blot only one or few genes.

Application:

Since they allow to determine the profile of expression of the expression of the cell in a given state, it’s also said that microarrays allow expression profiling

5.2.1 cDNA microarrays

4 steps:

  1. BUilding of the cDNA microarrays: full section of ESTs (Expressed Sequence Tags, short sub-sequences of a transcribed cDNA sequence).
  2. Sample preparation: two mRNA samples are prepared, retro-transcribed into cDNA and made fluorescent with different colors (Cys3, green, uses for the control; Cys5, red, uses for the test).
  3. Hybridization: gene transcripts expressed in sample, prepared and marked are hybridized with their complementary sequence on the microarray
  4. Measure of the gene expression: the fluorescent measure in every spot gives a measure of which genes are expressed in each of the two samples.

Images of cDNA microarray:

From images to data: A laser take the insensitive of each spot and transform it in a data.

Pros and Cons of “spotted” technology:

5.2.2 Oligonucleotide microarrays

Ultima modifica:
Scritto da: Niccolò Papini